A compendium of bacterial and archaeal single-cell amplified genomes from oxygen deficient marine waters

Oxygen-deficient marine waters referred to as oxygen minimum zones (OMZs) or anoxic marine zones (AMZs) are common oceanographic features. They host both cosmopolitan and endemic microorganisms adapted to low oxygen conditions. Microbial metabolic interactions within OMZs and AMZs drive coupled biogeochemical cycles resulting in nitrogen loss and climate active trace gas production and consumption. Global warming is causing oxygen-deficient waters to expand and intensify. Therefore, studies focused on microbial communities inhabiting oxygen-deficient regions are necessary to both monitor and model the impacts of climate change on marine ecosystem functions and services. Here we present a compendium of 5,129 single-cell amplified genomes (SAGs) from marine environments encompassing representative OMZ and AMZ geochemical profiles. Of these, 3,570 SAGs have been sequenced to different levels of completion, providing a strain-resolved perspective on the genomic content and potential metabolic interactions within OMZ and AMZ microbiomes. Hierarchical clustering confirmed that samples from similar oxygen concentrations and geographic regions also had analogous taxonomic compositions, providing a coherent framework for comparative community analysis.

North Pacific subtropical gyre present dysoxic conditions capable of supporting anaerobic metabolism through microbial remineralization of sinking particulate organic matter 3 (Fig. 1a). Low oxygen coastal and open ocean OMZs such as the Northeastern Subarctic Pacific (NESAP) present suboxic conditions encompassing the redox transition for nitrate (NO 3 −) reduction (Fig. 1a). Anoxic marine zones (AMZs) are further differentiated by nitrite (NO 2 −) accumulation with or without sulfide accumulation (sulfidic bottom waters and open ocean  4 . Solid lines represent observed data, while the dashed line represent a sporadic accumulation event. (b) OMZ and AMZ sampling locations for single-cell amplified genomes (SAGs) are indicated. The total number (white) and sequenced (black) SAGs obtained from each location are denoted with a circle proportional to the corresponding number of samples in the dataset. The Ocean is coloured according to the lowest mean statistical value for the oxygen concentration reported for each 1° and 5° grid in the 2018 annual NOAA World Ocean Atlas 119 , with white grids indicating locations where oxygen concentration data was unavailable. Sampling sites from oceanic midwaters include the North Pacific Subtropical Gyre (NPSG) and the South Atlantic Subtropical Gyre (SASG). Sample sites from low oxygen OMZs include the Northeastern Subarctic Pacific (NESAP). Sample sites from AMZs include the Eastern Tropical North Pacific Gyre (ETNP) and Eastern Tropical South Pacific Gyre (ETSP). Sites from coastal upwelling systems with ephemerally sulfidic bottoms include the Eastern South Pacific Coastal Upwelling (ESPCU) and Benguela coastal upwelling (Benguela). Sampling sites from sulfidic bottom basins include Saanich Inlet (SI) and the Baltic Sea. Geolocalization coordinates and the number of samples for each location are detailed in Table 1. or low-oxygen minimum zones (OMZs), respectively) [4][5][6] . For example, AMZs in the Eastern Tropical North Pacific (ETNP) and Eastern Tropical South Pacific (ETSP) present nanomolar oxygen conditions supporting NO 3 − reduction to NO 2 − and further reduced nitrogen products without hydrogen sulfide (H 2 S) accumulation (Fig. 1a). In contrast, coastal upwelling systems such as Benguela upwelling off the coast of Namibia present episodic shifts in oxygen deficiency, supporting the emergence of transient sulfidic plumes (Fig. 1a). Anoxic sulfidic conditions are also present in coastal fjords, such as the Saanich Inlet (SI), where glacial sills restrict water mass circulation. Sulfidic bottom conditions are also observed in marginal seas, such as the Baltic Sea (Fig. 1a).
Different geochemical profiles within OMZs and AMZs create ecothermodynamic gradients 7 driving coupled biogeochemical cycling of carbon, nitrogen and sulphur by cosmopolitan and endemic microorganisms adapted for life under low oxygen conditions (reviewed in [2][3][4]8 ). Understanding how these metabolic interactions contribute to nitrogen loss and climate active trace gas production is a critical challenge 9-12 . Global warming exacerbates water column oxygen deficiency through thermal stratification and changes in water mass circulation, resulting in OMZ and AMZ expansion and intensification [13][14][15] . Other factors, including excessive nutrient inputs (eutrophication), also contribute to coastal and marginal sea oxygen deficiency [15][16][17][18] . Efforts to model coupled biogeochemical cycles within OMZs and AMZs using both gene-centric and genome-resolved metagenomic approaches have identified key microbial populations that would benefit directly from availability of improved genome assemblies with increased taxonomic resolution 19,20 .
Cultivation-independent whole genome shotgun sequencing provides direct insights into microbial community structure and function in natural and engineered environments [21][22][23][24][25][26][27] . As sequencing technologies improve, it becomes possible to assemble genomes from metagenomes with increasing taxonomic resolution 20 . However, despite an expanding reliance on metagenome-assembled genomes (MAGs), several challenges remain, including resolving population microheterogeneity 28 , incomplete or chimeric genome assemblies (resulting from either assembly or binning), coverage bias, and limited availability of taxonomically characterized reference genomes for cross-validation [29][30][31] . Advances in fluorescence-activated cell sorting (FACS) and sequencing technologies enable study of uncultivated microorganisms at the individual cell level, providing more accurate taxonomic labels and associated mobile genetic elements (MGEs) [32][33][34][35][36][37][38] . Resulting single-cell amplified genomes (SAGs) and MGEs have been used to illuminate coding potential of "microbial dark matter" 39 , provide accurate linkages between taxonomy and function underlying biogeochemical cycles 20,21,30,40 , and to evaluate genome streamlining 41 , fine scale population structure 28,37,42 and virus-host dynamics 43 . Recent release of the Global Ocean Reference Genomes Tropics, or GORG-Tropics provides a valuable compendium of taxonomically defined SAGs containing >12,000 partial genome sequences from tropical and subtropical euphotic ocean waters 44 . Although a small subset of GORG-Tropics SAGs were collected from 'oceanic midwater low oxygen' waters (2,136 of 20,288 sequenced SAGs) 44 , oxygen-deficient marine waters remain conspicuously underrepresented, considering their substantial biogeochemical impact on marine ecosystem functions and services.
Here, we present a global compendium of bacterial and archaeal SAGs from OMZs and AMZs. This compendium contains 5,129 taxonomically identified SAGs derived using a combination of targeted and untargeted cell sorting methods, and isolated from environments covering the full range of geochemical profiles associated with extant, oxygen-deficient marine waters 4 (Fig. 1). Currently, 3,570 of these SAGs have been sequenced, assembled and decontaminated, based on established genomic standards 45 (Fig. 2, S1a-c). Sequenced and assembled SAGs were processed through the Microbial Genome Annotation Pipeline 46 for gene prediction and functional annotation, and are available through the Integrated Microbial Genome platform (IMG; https://img.jgi.doe.gov/) 47 or IMG/ProPortal (https://img.jgi.doe.gov/proportal). The collection of SAG sequences provides an invaluable resource to infer metabolic traits, resolve population structures, and assess spatial and temporal trends of relevant taxonomic lineages within OMZ and AMZ microbiomes.

Methods
Sample collection and cryopreservation. Approximately 1-2 mL seawater samples were collected in duplicate or triplicate during various oceanographic cruises within different OMZs and anoxic waters ( Fig. 1 and Table 1). Samples were placed in sterile cryovials and amended with one of the following cryoprotectants: glycine betaine (6% [v/v] final concentration 39,48 ), glycerol (10% [v/v] final concentration 28,49 ), or glycerol-TE buffer 39,50 . Environmental seawater collection was performed using a Niskin-bottle rosette, or a Pump Profiling System for the NBP13-05 cruise (ETSP; R/V Nathaniel B. Palmer, July 5-7th, 2013), equipped with a conductivity-temperature-depth profiler, dissolved O 2 sensor, fluorometer and transmissometer. A modified sample collection protocol was used during the BiG RAPA cruise (ETSP, off the coast of Chile, November 19th 2010, 55 m depth) which was first enriched on-deck selecting for chlorophyll-containing microorganisms 51 . Triplicate samples were passed through a 60 μm size mesh and sorted through an Influx TM (BD Biosciences) flow cytometer system. Approximately 4,000 cells were sorted into 1 mL of sterile glycerol-TE buffer. Sorting was triggered based on the pigment content of particles in the red emission channel (excited by the 488 laser), using forward scattered light as a proxy for particle size. All samples were cryopreserved in liquid nitrogen and then stored at −80 °C, before being processed for single-cell amplified genome generation.
Microbial isolation and Single-cell Amplified Genome (SAG) generation. Samples were thawed and microbial cells sorted at the Bigelow Laboratory for Ocean Sciences' Single Cell Genomics Center (SCGC) or the Joint Genome Institute (JGI). Samples were passed through a sterile 40 μm size mesh before microorganisms were sorted by either a non-targeted isolation procedure or specific selection for cyanobacteria. For non-target isolation, the microbial particles were labelled with a 5 μM final concentration of the DNA stain SYTO-9 (Thermo Fisher Scientific). Microbial cells were individually sorted using a MoFlo TM (Beckman Coulter) or an InFlux TM (BD Biosciences) flow cytometer system equipped with a 488 nm laser for excitation and a 70 μm nozzle orifice 52 . The gates for the untargeted isolation of microbial cells stained with SYTO-9 were defined based on the green www.nature.com/scientificdata www.nature.com/scientificdata/ fluorescence of particles as a proxy for nucleic acid content, and side scattered light as a proxy for particle size. For isolation of cyanobacterial cells, gates were defined based on autofluorescence in the red emission channel. An improved discrimination of cyanobacterial cells from detrital particles was performed based on the ratio of green (SYTO-9 DNA label) versus red (chlorophyll content) fluorescence. The cytometer was triggered on the side-scatter using the "single-1 drop" mode. All microbial single-cells were sorted into 384-well plates containing 600 nL of 1X TE buffer per well and then stored at −80 °C until further processed. A subset of microbial cells, that generated the SAGs identified with the ' AAA001' prefix (part of the SAGs collected at the SASG, 800 m depth), were sorted into 'prepGEM TM Bacteria reaction mix' (ZyGEM) 48 . For samples processed in the Bigelow Single Cell Genomics Center, 64 of the 384-wells on each plate were used as negative controls (no droplet deposition), and 3 wells received 10 cells each to serve as positive controls.
The microbial single-cells sorted into TE buffer were lysed as described previously by adding either cold KOH 53 , or 700 nl of a lysis buffer consisting of 0.4 mM KOH, 10 mM EDTA and 100 mM dithiothreitol 52 . Samples were incubated for 10 min at either 4 or 20 °C for samples lysed with cold KOH or lysis buffer, respectively. Microbial cells sorted into 'prepGEM TM Bacteria reaction mix' were first lysed following the manufacturer's instructions and then processed through the cold KOH lysis procedure 48 . The microbial nucleic acids were then whole genome amplified in individual wells through either through traditional Phi29-mediated "Legacy Multiple Displacement Amplification" (L-MDA 39,53 ) or using a more thermostable Phi29 polymerase via "Whole Genome Amplification-X" (WGA-X 52 ). The products of this procedure are here referred to as SAGs.
Two methods were used to assign taxonomy to the SAGs. Initially, taxonomic assignments for SAGs generated through L-MDA were conducted by extracting SSU rRNA gene sequences directly from the whole genome assemblies, or from the amplicons described above. For all SAGs generated through the WGA-X procedure that were not screened for any phylogenetic marker prior to genome sequencing, a search was conducted to identify www.nature.com/scientificdata www.nature.com/scientificdata/ SSU rRNA gene sequences > 500 bp within the genome assembly (Supplemental Figure S2). This search was performed through the Integrated Microbial Genomes & Microbiome system (IMG/M, https://img.jgi.doe.gov/m/) based on its gene prediction and annotation pipeline (see below) 45,46 . Additionally, SSU rRNA sequences were recovered from a subset of SAG assemblies with the Recovering ribosomal RNA gene sequences workflow with Anvi' o v7.0 62 . These SSU rRNA gene sequences were processed as described above to assign taxonomy (Table  S1) 60 . Because 1,281 SAGs did not provide sufficient SSU rRNA gene sequence information (Table S2) 60 , all SAG assemblies were also processed through the Genome Taxonomy Database Tool Kit GTDB-Tk v2.1.0 [63][64][65][66][67][68][69][70] with GTDB R07-R207_v2 71-73 reference data for multi-locus taxonomic assignment. This allowed for taxonomic identification of SAGs missing SSU rRNA gene sequences, and offered an additional reference compared to those assigned by partial or complete phylogenetic marker sequences. The number of taxonomic assignments that were generated using both methods are detailed in Table S3 60 , with the assignments being available in Table S1 60 . www.nature.com/scientificdata www.nature.com/scientificdata/ Genome sequencing, de novo assemblies and decontamination. SAGs were sequenced as described previously 39,52 , and their reads assembled into contigs using SPAdes v2.2.10 to v3.10.0 74 . Contigs of <2,000 bp were removed from SAG assemblies. Completeness and contamination levels of SAG assemblies were then determined using CheckM v1.2.1 75 . To comply with established genomic standards 45 , assemblies exceeding 5% estimated contamination were run through ProDeGe v2.2 to v2.3 76 to eliminate the conflicting contigs until there was no improvement in their contamination estimates. The contamination and completeness levels for these SAGs were then re-evaluated using CheckM v1.2.1 75 and those that still exceeded 5% contamination were manually decontaminated through the Metagenomics Workflow and Refining MAG bin workflows available in Anvi' o v5 62 .
Manual decontamination of Saanich inlet SAGs. A total of 14 SAGs exceeded 5% contamination after being processed through the ProDeGe decontamination pipeline 76 and short-contig trimming (Table S5) 60 . These SAGs were manually decontaminated with Anvi' o v5 using the Metagenomics Workflow and Refining MAG bin workflows 62 . A contig database and the corresponding Hidden Markov Model for each database was generated for each of these SAGs. The taxonomy for each gene was then assigned using the Centrifuge Database 77 . Additional manual curation of these SAGs was carried out using differential coverage of each SAG based on metagenomic reads from Saanich Inlet metagenomes (August 2011 100 m, 150 m, and 2012 100 m, 150 m. Biosamples SAMN05224439, SAMN05224444, SAMN05224441, SAMN05224518, BioProject PRJNA247822) 78 . Raw metagenomic reads were mapped with bwa v0.7.17-r1188 79 and samtools v1.6-19-g1c03df6 80 . Anvi profile databases were generated for each SAG by utilizing the contig databases and the read mapping files. Individual contigs were manually removed through the interactive interface based on taxonomic identity, average tetranucleotide identity, and low differential coverage. The new assemblies were exported as fasta files and re-assessed with CheckM.

SAG quality classification.
After CheckM was run on all decontaminated SAG assemblies, the quality of each SAG was determined based on Bowers et al. 2017 45 . SAGs that were <50% estimated completeness were considered low quality SAGs. SAGs that had ≥50% estimated completeness and <10% estimated contamination were considered to be at least medium quality. To determine if a SAG was high quality, in addition to having >90% estimated completeness and <5% estimated contamination, SAGs need to have 23 S, 16 S, and 5 S rRNA genes and at least 18 tRNAs present in the final assembly. To identify and quantify the rRNAs and tRNAs, SAGs were passed through Barrnap v0.9 (https://github.com/tseemann/Barrnap) 81 and tRNA-SE v2.0.11 82 respectively. Any SAGs having >90% estimated completeness and <5% estimated contamination but missing one or more rRNA genes with at least 18 tRNAs were classified as medium quality. The rRNA and tRNA counts, as well as Quality classifications for each SAG can be found Table S1 60 .
Genome annotation. All genome assemblies were annotated through the Joint Genome Institute's IMG platform and annotated using the JGI Microbial Genome Annotation Pipeline 46 . The IMG (https://img.jgi.doe.gov/) or IMG/ProPortal (https://img.jgi.doe.gov/proportal) systems host all final assembled and decontaminated SAG sequences, with gene calls and functional annotations publicly available through these portals. All IMG accession numbers for sequenced SAGs are provided (Table S1) 60 .
Hierarchical clustering. The recovered SSU rRNA gene amplicon sequences covering the V4-V5 variable region were clustered at 97% identity using CD-Hit [83][84][85] , and assigned identifiers based on a representative sequence from each cluster. Based on the taxonomic identity of these representative sequences, the proportion of SAGs associated with each cluster was determined on a per sample basis. These proportions were used to calculate Bray-Curtis Dissimilarity indices using the vegdist() command in the vegan R package v2.5-7 86 . The samples were clustered based on Bray-Curtis dissimilarity, using an average linking method for hierarchical clustering using the hclust command in base R and visualized (Fig. 3).    www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation Early implementation of SAG workflows involved MDA of anonymously sorted single cells in 384-well plate format followed by PCR amplification of selected phylogenetic markers e.g., SSU rRNA gene, to identify SAGs of interest for sequencing 39 . Recent development of WGA-X coupled with low-coverage genome sequencing (LoCoS) provides a more economical workflow to identify hundreds of SAGs per sample without potential PCR bias 52 . Targeted methods of sorting based on spectral properties of cells or substrates have also been applied to SAG selection and sequencing, including cyanobacteria and cells binding to fluorescently labelled substrates, such as cellulose [87][88][89] . Although the SSU rRNA gene remains one of the most extensively used phylogenetic markers and has well-established and curated databases (e.g. SILVA 58 ), multi-locus phylogenetic assignment tools, such as GTDB-Tk 63-70 generate equally valid results using more information. For this compendium, microbial diversity was assessed using taxonomic labels and abundance information for SAGs sequenced using non-targeted cell-sorting approaches. However, not all SAGs had a match for their SSU rRNA gene taxonomy due to either their amplicon sequences being too short (188 L-MDA SAGs with < 500 bp amplicons; Table S2) or no SSU rRNA gene was recovered from the random genome amplification (i.e. 1,093 WGA-X SAGs; Table  S2). Thus, an additional taxonomic classifier was run for all SAGs, based on whole-genome assignment using GTDB-Tk [63][64][65][66][67][68][69][70] . Both sets of classifications are in Table S1. Hierarchical clustering (of 3,217 SAGs that contained an assignable V4-V5 SSU rRNA gene amplicon sequence) revealed a higher similarity among those from depths and geographic locations with similar oxygen conditions (Fig. 3), a result consistent with prior observations [2][3][4]8 . It should also be noted that many of the SAGs amplified with the WGS-X method originated from highly oxygenated samples, which had similar taxonomic compositions and therefore clustered together. Based on this information, the OMZ and AMZ SAG sequences presented here should serve to complement previous SAG collections obtained from (oxygenated) euphotic ocean waters 44,49 .

Usage Notes
This compendium is intended to fill a critical gap in taxonomically labelled reference genomes from marine oxygen-deficient waters. Included SAG sequences were processed using well-established assembly and decontamination workflows. However, links to the raw data are also available for users interested in using future software versions or implementing alternative workflows. Any approach should aim to discern contaminating sequences associated with FACS (co-sorting two or more cells into a single well, environmental DNA contamination) and WGA (reagent contamination) 90 . It is important to emphasize that SAG sequences often contain MGEs, including plasmids and viruses 43,[91][92][93] . These sequences are typically filtered out during the decontamination process, although differentiating between endogenous chromosomal intervals such as islands or prophage from MGEs requires careful manual curation. Users interested in MGEs are encouraged to work with the raw data or initial assemblies prior to decontamination. Note that genome assembly contamination estimates obtained by CheckM should be handled with caution, as this tool is prone to both over and under estimating completeness and contamination 94 . As described above, recent advances in MDA using WGA-X have led to improved SAG completion and the adoption of LoCoS has obviated the need for SSU rRNA gene amplicon screening to select SAGs of interest for sequencing 52 . The sequences included in this compendium include both older and more contemporary SAG sequencing approaches. The results are integrated by presenting SSU rRNA gene and multi-locus taxonomic assignments based on SILVA, NCBI, and GTDB.
Despite improvements introduced with WGA-X 52 , single-cell genomics invariably results in incomplete genome assemblies (Fig. 4, Supplemental Figure S4). This limitation can be overcome in part when multiple SAGs sharing extremely high levels of nucleotide identity are obtained from the same sample. Such closely related sequences can be analysed together, enabling more complete metabolic reconstruction 7,33,42,51 , or used to generate combined assemblies 50,95 . In addition, population-level genomes can be obtained through hybrid assemblies combining SAG sequences and metagenomic sequences 96,97 . In all cases, SAG contigs should be quality filtered to eliminate the presence of contaminating sequences and comply with established genomic standards 45 . All SAG assemblies reported here were thoroughly decontaminated, reaching <5% contamination for all except four SAGs (that only had between 5-10% contamination; Fig. 4, Supplemental Figure S4, Table S2).
Many SAGs included in this compendium have not been sequenced, and the DNA remains in storage. Users are encouraged to identify underrepresented microorganisms from OMZ and AMZ microbiomes based on the provided taxonomic information that can be prioritized for sequencing and shared with the user community. At the same time, we recognize that there are also underrepresented OMZ and AMZ environments not included in this compendium. Sequences from the Black Sea, South China Sea, Arabian Sea and Bay of Bengal, among others, would provide a more robust representation of oxygen-deficient marine waters for use in comparative studies and modelling efforts. Finally, the SAG sequences included in this compendium can be used as taxonomically characterized reference genomes to recruit metagenomic data sets from marine environments, improve pathway prediction methods 29,98-106 or expand reference packages for gene-centric analysis of functional markers 107 .